Theory and rationale of interpretable all-in-one pattern discovery and disentanglement system

In machine learning (ML), association patterns in the data, paths in decision trees, and weights between layers of the neural network are often entangled due to multiple underlying causes, thus masking the pattern-to-source relation, weakening prediction, and defying explanation. This paper presents a revolutionary ML paradigm: pattern discovery and disentanglement (PDD) that disentangles associations and provides an all-in-one knowledge system capable of (a) disentangling patterns to associate with distinct primary sources; (b) discovering rare/imbalanced groups, detecting anomalies and rectifying discrepancies to improve class association, pattern and entity clustering; and (c) organizing knowledge for statistically supported interpretability for causal exploration. Results from case studies have validated such capabilities. The explainable knowledge reveals pattern-source relations on entities, and underlying factors for causal inference, and clinical study and practice; thus, addressing the major concern of interpretability, trust, and reliability when applying ML to healthcare, which is a step towards closing the AI chasm.

capability of PDD in interpreting and tracking the input, throughput and output of the entire process.
We want to show high accuracy, precise interpretability with statistical and functional support from different types of data and problems. As we conjecture that the AV-association Subgroups and DSU found on entities are indeed associated with the known/unknown primary sources, we want to find strong support of the conjecture from the Case Studies backed by established knowledge or new statistical, scientific and/or medical findings. In the meantime, we highlight the performance of PDD and compare it with that of other ML models. At last, we expound in detail the algorithmic platform and process which significantly reduces the time and space complexity and produces a compact explainable knowledge representation for further exploration and decision-making.
In the new PDD paradigm, PDD discovers from DS containing stastical significant disentangled patterns occurring on entities associated with distinct primary sources. Hence it saves considerable effort and time of feature engineering adopted in many existing models. We apply PDD on wCL to take full advantage of the ground truth and discover disentangled patterns from nCL unbiased by Classes for error detection and correction to come up with an Auto-Error-Correcting module to improve class association and entity clustering. Hence, we remove the concerns of the users that the existence of unaware anomalies and label discrepancies in the data may affect the results and decision.
In the new paradigm, when a class label is explicitly given in an entity as an AV, it is considered, and proven in the process, as a primary source associated with the discovered disentangled pattern(s) on the entity. PDD takes it as the ground truth knowledge to obtain class association.
We refer to this as the Pattern-Class Association. In the meantime, PDD also obtains results from nCL unaffected by class label to check the consistency of the discovered patterns and the implicit class label of the entities. When an entity contains pattern(s) of another class instead of the given class, PDD changes the class status and readjust the class label. We call this Class Readjustment and denote the process or the results by Cra. After Cra confirmation, we take the final class status assigned to the entities for performance evaluation.
PDD also discovers rare groups in distinct DS without relying on class labels to obtain new knowledge not given in the ground truth. Hence, in the new paradigm, PDD utilizes the knowledge given in the ground truth and in the meantime identifies and rectifies errors caused by unnoticed biases or label-discrepancies as well as rare groups/classes. The objective of these Case Studies is to use verifiable data and problems to exemplify and validate the conjecture and also the PDD capabilities as listed in Table II. In traditional ML, since there is no direct way to identify class label discrepancies and rare new classes, outliers and/or biases, imbalanced classes or rare groups or patterns may unnoticedly scatter in the data, we have to rely on k-fold cross-validation via training and testing to randomly distribute the anomalies and/or locate uneven groups in different runs for classifier evaluation.
Usually, comparative results are obtained to guide the fine-tuning of the classifier via feature engineering and parameter setting. Hence, it often takes intensive effort of the trainers to screen the data or to use big data to minimize the effect of anomalies. (From here on, we use the term "anomaly" to include class label discrepancies and outliers.) Since PDD uses only statistically connected AVs automatically from DS it does not need feature engineering or search methods. Furthermore, it can identify anomalies from inconsistency check and make corrections if confirmed and discover rare groups and/or imbalance classes wherever they are in the data. Thus, PDD can obtain class-association results from both wCL and nCl and retain only the successful class association patterns/rules for class association. Thus, it can solve small, imbalanced as well as big data problems. A complete classification and predictive analysis of PDD with Cra will be addressed in a separate paper. In this paper, we will show the efficacy of PDD in detecting anomalies and using Cra to improve entity class-association, expounding its potential in more general predictive analysis.
To further validate PPD capabilities as listed in Table II (Main Text), we use six case studies: two on proteomic data, one on histopathological/cytopathological data, one on clinical data, two on imbalance classes with noise (one on thoracic surgical risk and another on a directly verifiable synthetic data). In each case stduy, we first give a simple description of the dataset and the problem, and then the Knowledge Base obtained. In the Knowledge Base, we display the disentangled patterns possessed by each individual entity as well as its class status attained, including anomaly, outlier and Cra. We then present the unsupervised entity clustering result of the experiment. Finally, we furnish additional explanations elaborating the reasons and efficacy of PDD in achieving its unique tasks. In the Main Text, we use icons in the figures which are easy to follow. In the Supplement, in most of the figures, we retain the format obtained close to the output of the PDD computation program. We will keep the outputs in a more formal manner.

Case Study 1: Cytochrome C
The first example on APC of cytochrome C is described in the Main Text. We use this simple example with distinct biological ground truth to validate every capability listed in Table II and give explanations on how and why PDD can fulfill each task so that it will be much easier to interpret the results of more sophisticated data and problems.

Data and Problem
In bioinformatics, there is a need to identify and analyze local and co-occurring functional sites, elements and regions in bio-sequences. Aligned Pattern Clusters (APC) is an unique way we developed to discover and locate such regions conserving important functionality within and between bio-sequences with reference [1] [2]. Supplementary Figure 1 shows an example of an APC and how it is obtained [1]. For analytic purposes, we treat an APC as a relational dataset with the aligned sites as attributes and amino acids as AVs. We used this dataset for its distinct taxonomic class, presumed as an important primary source attributed to the conserved amino acid patterns in the functional domain. Such premises can be later verified. . It is an Aligned Pattern Cluster (APC). Based on the presumption that the discovered statistically significant association patterns imply conserved functionality, the amino acids aligned in columns represent functionally/structurally conserved sites to form the aligned pattern cluster revealing the similarity/variation of the functional patterns in this conserved domain of the protein family [1]. Note that after alignment, each site (column) can be treated as an attribute and each item in the column as an AV. The top row gives the AV positions of the patterns in the APC which can be traced back to their location in other settings. The notations C1, C2, and C3 represent the classes (known or unknown at the outset but made known later) of each sequence denoted by the sequence ID. obtained. There could be other primary sources indicating common functionality in the similar functional domain of different species, particularly in a set of related multiclass data. This example shows PDD can find the primary sources of each class as well as from common functional domains reflected by common sub-pattern(s) for more than one class since both are primary in that functional domain. Such primary sources can be found in a hierarchical and automatic manner based on the intrinsic associations inherent in the data without relying on class labels. This is also a unique capability of PDD.

PDD Knowledge Base (Knowledge Base)
Supplementary Figure 2 represents the compact, yet complete Knowledge Base obtained from the APC of cytochrome c -one set with class labels included in the dataset denoted as wCL, and the other without, denoted as nCL. We use these notations to represent the dataset, or the results obtained from them by PDD. In wCL, class label is treated as an additional AV for finding its association with other AVs.  In this wCL, we selected only patterns containing class labels as an AV to support class association.
Thus, we fully utilized the information provided by the ground truth. We called the Knowledge Base a ca-Knowledge Base (stands for Class-Association Knowledge Base). It is just an excerpt of the Knowledge Base of wCL containing patterns with class label as an AV so that we give full weight of the class label given. Later we used the disentangled patterns obtained from nCL, unbiased by the ground truth (i.e the class label), to spot the inconsistency between the discovered patterns and the implicit class label on each entity. From the inconsistency, we could identify label discrepancy or other misinformation found on the entities, disregarding where they are located, in a small or a big group. They are verifiable because of the transparency and interpretability of the patterns obtained from the Knowledge Base and Entity Clusters. The findings can be related to the source environment in the biology world for further exploration and confirmation. The

a) Knowledge Space Revealing AV-association Disentanglement
In the Knowledge Space from wCL, we found in the primary source columns perfect disentanglement in both summarized Knowledge Base and Comprehensive Knowledge Base  a)). Note that these compact set contains all the disentangled patterns discovered in the dataset. In the wCL, the taxonomic class labels play an important role. As for pattern order (i.e., the number of AVs making up the pattern), in the Summarized Knowledge Base, we displayed only the range of their variations by two notations, e.g., "4*" denotes that the pattern in the DSU is 4 but with variation; while "5_8" implies that it varies from order 5 to order 8.
In the nCL (Figure 3 in nCL). This indicates that PDD can discover the intrinsic association. In nCL, when class label is not affecting the pattern association, PDD not only can discover AV-associations associated with the primary sources related to the classes (according to the implicit class labels), but also to those associated with functionality (via sub-pattern(s)) common to more than a single class. For instance, from the first row in the DSU[1 1 1] in the Comprehensive Knowledge Base For a multiple class problem like this one, certain primary sources of entities in a DSU like DSU[1 1 1] could come from the common functionality among classes, but their primary sources related to distinct classes can still be found in some disentangled space or via further disentanglement (Supplementary Figure 2(c)). This case study shows the unique hierarchical disentanglement capability of PDD which can disentangle patterns from entities associated with distinct classes as well as common functionality among classes automatically without relying on outside clues/guides. It supports our premises that primary source implies something of functional importance to entities associated with specific AV-associations/patterns discovered by PDD. This is what PDD has revealed in both the cytochrome c and class A scavenger receptor data. Such revelation provides interpretability and guidance for further exploration.
EID84 is an interesting case. It contains a second order pattern of [FG … A] when Fungus is the class label and A occurs only in Fungus and Insect. It contains no other pattern. Hence in wCL it is classified as FG. However, in nCL where no class label exists, it is found to pertain to the group with the implanted rare pattern. Then our implantation is the primary source. Hence EID84 is both a Fungus and contains a pattern from another source related to the implanted rare pattern. PDD is able to offer an in-depth analysis of this case.

b) Pattern Space for Pattern Analysis and Interpretability
In the Summarized Knowledge Base, the union pattern in a DSU is represented by the union of the

c) Entity Space for Class/Group Association Supporting Anomaly Detection and Correction.
In the entity space of the Comprehensive Knowledge Base (Supplementary Figure 2  An entity pertains to 1. Cor (stands for correctly classified) if a pattern discovered in compliance with its given class label has the "support" (i.e., the total number of patterns of a class label that the entity possesses) exceeding those of other classes.

Inc (incorrect classified) if otherwise.
3. OL if it contains no statistically significant patterns. 4. Cra if it possesses no pattern of its explicit or implicit class label but that of another class with its class label readjusted to the confirmed class.

Und (undecided) if the entity has equal support from different classes.
We should note that while the pattern(s) of an entity obtained in wCL is (are) directly associated with the given class label, it shows strong support of being classified into that class unless its class label is questionable. If the entity is found possessing a pattern of another class and none of its given class, it is an Cra with its class label readjusted. Often, if mislabels or biases exist in wCL, without the influence of class label, PPD can discover the inconsistency more effectively in nCL. Here we will give examples of how PDD conducts the pattern-class consistency check on nCL where the AV-associations were not influenced by class labels. In Figure 3(c) in Main, we observe that E31 was given an explicit class label as Mammal and was found as an OL in wCL, whereas in nCL, PDD found that E31 pertains to a rare group containing the implanted pattern [ …T Y F . . .] with two other entities E60 and E84 with class label Fungus and Insect respectively. This cluster was not found in wCL because each of its members is constrained by its originally given class label but found in nCL when the class labels were absent. This exemplifies the case when class label influence is removed, the rare subgroup not influenced by class label is found. In another example, E73 was given a class label as Fungus. However, it was found in wCL as an OL whereas in nCL as a Plant. That was confirmed in Knowledge Base and Entity Clusters as possessing only Plant pattern(s). Therefore, to discover an unnamed or misnamed rare group or class label discrepancies, an unsupervised method is necessary to work with the supervised method as we propose in the all-in-one PDD system. This case study shows such and other capabilities of PDD listed in Table II {1-3, 5-12}.

Entity Clustering
Without specifying the number of clusters or setting any optimization criterion to direct the clustering, PDD obtained six clusters from DSUs ( Figure 3(e) in Main) based on the disentangled patterns discovered. They were naturally separated as they all came from statistically significant AV-association disentangled spaces DSs. Since these entity clusters were obtained from hierarchical clustering based on the degree of AVs shared by entity pairs, we observed some variation but not crucial, e.g. PDD found two Mammal clusters in DSU[1 1 1] and DSU[1 1 2] automatically and revealed their difference in S90 and S96 (Figure 3(c) in Main).

Discussion
As PDD automatically corrects the class label discrepancies, it naturally places the corrected ones to their right cluster based on the disentangled patterns they possess in the DSU, disregarding what class label is given or discovered. For example, in this experiment, E73 was labeled as a Fungus but discovered as a Plant and being placed into the Plant cluster. Hence, it was marked as a Cra and considered as correctly placed. However, E70 was labeled as an Fungus but found as an OL in Knowledge Base and placed into the Insect cluster. We consider it misplaced. As shown in the Summary at the bottom of the table, PDD obtained an accuracy of 97.82% before Cra and 98.91% after.
After organization and display of the transparent Knowledge Base, PDD also made the entity clustering transparent (Figure 3(e) in Main) as it automatically partitioned the entities into eight clusters from the DSUs and revealed the disentangled patterns/pattern-clusters on each. Finally, we noted that all the information in the Knowledge Base and Entity Clusters, including anomaly detection and correction, were obtained by PDD on nCL with class label readjusted from the integrated results of wCL-an all-in-one process. They were automatically synchronized.
In this dataset, using taxonomic class as presumed primary sources, PDD obtained consistent and unifying results even for such a small dataset. It not only discovered patterns associated with primary source, such as the taxonomic classes, but also discovered sources associated with the functionality common to different groups such as to Mammal, Fungus and Insect in DSU[1 1 1] and to Plants and Fungus in DSU[2 1 2] (Supplementary Figure 2(b)). The results of all these tasks exemplify and validate PDD capabilities as listed in Table II {4-12}. Although the tasks unique to what PDD achieved are not the same as achieved by other ML models, the high accuracy of class association and entity clustering can match results of the best of the existing methods ( Figure 5 in the Main). In the meantime, all the results obtained are explainable and traceable. Therefore, they can be used for exploratory study jointly with other scientific methods. PDD has a unique role to play in providing statistical support and explainable insights.

Data and Problems
In the second case study, we used another verifiable dataset of APC obtained from the Class A Scavenger Receptors (SR-A) with amino acids as distinct AVs and the subclasses [1] [3] as possible primary sources. SR-A is a diverse family of proteins characterized by their ability to bind modified lipoproteins [3]. Although the 5 members (Marco, SRA, Scara3, Scara4, Scara5) [1] [3] of this family could bind modified lipoproteins, they are different in terms of their sequence pattern, location, structure, and therefore function (Supplementary Figure 5). For instance, within the same family, their protein length varies from 451 to 732 with the functional domains residing in different sequence locations (Supplementary Figure 5). Thus, SR-A is a protein family, with conserved yet diverse function subgroups, ideal in using it to validate the capabilities of PDD in handling multiclasses and relating the findings to the proteomic real world.
The APC of SR-A contains 95 samples and 12 attributes [1] [3]. This receptor has five distinct classes (Marco, Sra, Scara3, Scara4, and Scara5) located in five different function domains: Cytoplasmic, Collagenous, Transmembrane, a-helical and coiled-coil motifs ( Supplementary   Figure 4(b)). Since obtaining subclass classification and locating functional domains become important in proteomic study and fighting disease, this dataset was used to test whether PDD can fulfill such demand.

PDD Knowledge Base
Supplementary Figure 3 gives the Summarized Knowledge Base obtained from the APC of SR-A and a subset associated with DSU[1 1 1]. Since SR-A is a multi-class problem, we would like to use this case study to show how PDD handles it. As we shall see, the information from both Knowledge Bases and Entity Clusters are synchronized from the same run with interpretable results fulfilling the all-in-one capability of PDD as listed in Table II.  . (b). This is the summarized knowledge base displaying only patterns containing class labels. It is extracted from that in Supplementary Figure 3(a) containing patterns embodying explicit class labels as an AV to obtain entity class association. We also call it Class Association Knowledge Base (caKnowledge Base). It exploits the given class label in entity class association. However, it could be biased by the given class label as well. Hence, after assigning the class status to each entity, we use the results obtained from nCL not affected by the class label for consistency check to identify and readjust the entities (Cra) where label discrepancies were identified and confirmed. We displayed the nCL results in the second EID row at the bottom section of Supplementary Figure 3 For class association, we used patterns in the Comprehensive caKnowledge Base only where each pattern is associated with a distinct class. The last EID row in the bottom Summary Section (Supplementary Figure 3(b)) shows the final class status obtained by PDD after integrating the results of the first and the second row obtained from caKnowledge Base, and Knowledge Base from nCL. Here, we notice that most of the entities are correctly classified as their discovered class label complies with the given class label (with the same light color-code) in the EID row above (the fourth row in the Entity Space). PDD also discovered 3 implanted OLs (E96, E97, E98). We considered them as correct but did not use them in the accuracy estimation. Thus, we took 95 as the total and considered 3 other OLs as unclassified to come up with an accuracy as 96.84%. The consistency check found out that among the 2 other OLs found in wCL, one (E92) was found to be a Cra (for it was found containing a pattern of Scara4 instead), confirmed also in entity clusters (Supplementary Figure 4); and another (denoted as common) was found containing common patterns for all classes. Hence the accuracy after Cra is 93/95= 97.89%. This experiment shows that PDD can discover class status of entities with multiple subclasses, a small OL group and Cra to improve the class association. The patterns in caKnowledge Base can be used as rules for a supervised classifier using the same approach of anomaly removal and will be reported in another paper.

Supplementary Figure 4 displays PDD entity clustering results synchronized with Knowledge
Base in the same run. In ML clinical applications today, to obtain synchronized results from different procedures is still a challenge. By synchronization, it does not mean the results are identical. It implies that they are derived from the same algorithmic process. For this five subclasses problem, PDD obtained eight clusters, one for Marco, two for Scara3, one for Scara4, two for Sra and one consisting only two samples (one of Scara3 and the other Sra, corrected as Scara4). The Scara3 was also found as an outlier. As this dataset with multiple subclasses shown, in the new PDD paradigm, we do not rely on multi-objective optimization but obtain results through automatic and natural disentanglement of statistics distributions presumed to be governed by primary sources (the driver of the underlying AV-associations, such as functionality/taxonomy).
Here we observe that sub-clusters are automatically formed based on inherent differences of the AV-association patterns. To cite an example, samples of Scara5 (C2 and C8 in blue code) are broken into two subgroups. It is due to the differences of the AVs in A240, A241, and A242 in a small group of three entities (E63, 66, 76). This indicates that PDD can identify and locate mutants fast -an important step in genomic/proteomic research.
To give a fuller picture of the cluster, Column 1 of Supplementary Figure 4 show the associated class of each cluster based on the implicit class label of its majority member in the distinct DSU in the Knowledge Base of nCL. Here we found from nCL (not shown in the paper) five OLs (E22, 92, 94, 96, 97, 98). The first two were found in wCL as Sra (Column 6 Supplementary Figure 4).
The consistency check found 3 of them were implanted and so did not include them as misclassified. One (E92) listed as Sra contained only Scara4 patterns. So, it was considered as a Cra. but not being placed into any known cluster. It was considered misplaced. Another one (in blank), referred to as "common", contained a common sub-pattern shared by all subclasses (in blank). Hence the total number of misplacements came up to three, giving us an accuracy of 92/95= 96.84% very close to that obtained for Knowledge Base (Supplementary Figure 3(a)). The synchronized results obtained from the Knowledge Base and Entity Clusters of SR-A APC data together with that from Cytochrome C APC, validate PDD's capabilities as listed in Table II for solving multiple class problems.
The comparison results in Supplementary Figure 4

Discussion: Interpretability and traceability for Scientific Exploration
Because of the AV-association ground truth knowledge provided in this dataset [2], we will use it to demonstrate the interpretability and traceability of PDD. Supplementary Figure 5 shows the capability of PDD in discovering and locating patterns associated with functional domains scattered in the sequences of the SR-A family (legend in Supplementary Figure 5). It displays the sequence position in the five functional domains of the SR-A family as shown in the last column confirmed by existing literatures [2]. It is intriguing to note that from the disentangled patterns revealed in Supplementary Figures 3(a)

Data and Problems
Detection of tumors at the earliest possible stage is of paramount importance for cancer treatment, and a missed diagnosis may lose critical time that the patient needs. In the past, most ML researchers used this data for testing and evaluating supervised and unsupervised classification. This paper exemplifies how PDD can obtain correct pattern-class associations and identify the missed diagnosis and the misdiagnosis in the borderline cases from this large dataset.
Cancer Wisconsin dataset [5] is a health care benchmark dataset taken from UCI repository, which is a well-studied classical dataset with 699 cases for discriminating the instances of two possible classes: Benign 458 cases (distribution=65.5%) and Malignant (distribution=34.5%).

PDD Knowledge Base
Since the Breast Cancer dataset consists of 699 cases, we just present the Summarized Knowledge Base for the dataset wCL integrated with the results obtained from nCL. In this Knowledge Base, we only display several representatives of the correctly classified entities, but all the anomalies identified and rectified to explain how they were handled. In the Pattern Space, we noticed that AVs making up of the patterns for two classes are distinct, indicating the proficient use of the indicants in the dataset and the effectiveness of the discretization scheme. We also found that all patterns seldom have the "either this or that AV" case, like those common in other data mining and pattern discovery models, except one case in DSU[1 1 1] where the AVs of the either-or case are adjacent intervals. We also observed that disentangled patterns, in the bottom rows, not containing class label as an AV (with no label on the class label column) are subsets of the unions of patterns of those with class label and occurring on entities with the same implicit class label. This indicates that PDD can detect AV-associations associated with classes as primary sources with and without the class label given.

Supplementary Figure 6. Knowledge Base for Breast Cancer Dataset with Class Label given. This is the
Summarized Knowledge Base where the class label is used as a normal attribute. In this figure the class color code for Benign and Cancerous classes are red and green respectively. In each of the 5 DS in the Knowledge Space, Benign and the Cancerous classes are on the opposite side of the PC as indicated by the second index ("1" or "2") in the DSU. The primary sources show superb disentanglement. In the Pattern Space, note that the set of AVs making up the patterns for two classes are distinct but with minor variation in the subgroup (AV-Subgroups). Note that the patterns with class labels are quite similar to those without. In the Entity Space, we show only a few representative entities of Benign, Cancerous and Outliers in light green, red and grey shade respectively. The remaining ones were anomalies with readjusted class labels (denoted by Cra). The third row on the top section of the table shows the EIDs with a given class label in class color-code. In the bottom section, Row A denotes the class status found from wCL and Row B from nCL. Row C shows the result after the integration of wCL and nCL according to the class association rules.
Note that a few of Cra were found in wCL. We replaced the class status Inc found from nCl to Cra. Since all the Cra were absent before, they were considered as mislabeled. Thus, with 12 OLs, 4 Und and 16 mislabeled, the class association accuracy before Cra is (699-12-3-16)/699=95.57% Since after Cra and class status integration, only a single Und was retained. We attained an accuracy of 696/699=99.57%, comparable with the best results for supervised ML.
In the Entity Space, we displayed only a few representative successful classified cases: E1 and E458 among the 438 Benign and E459 and E699 among the 237 Cancerous cases. The remaining entities were found to be anomalies either in wCL or nCL. For the wCL cases, the anomalies (outliers or mislabels) can be found from the pattern possessed by the entities as indicated in the EID column in the figure. For example, E2 with implicit class label Benign was found by PDD as containing only Cancerous pattern from DSU[3 2 1] and DSU[5 2 1], but no Benign pattern. Hence, it was considered as a mislabel and its class label was readjusted as Cancerous (in darker red color, row A). It was confirmed later in entity clustering (Supplementary Figure 7) by the Cancerous patterns (in AVs with dark red shade) it possesses. Since this was confirmed in wCL as Cra, the misclassified status Inc (in blue color-code, row B) in nCL was dismissed, coming up with the class label of Cancerous as its final class status (row C). In E37, PDD found no significant pattern in its EID column in wCL and thus considered it as an outlier (OL) (row A). However, since other information was obtained in nCL (row B) (not shown here) in the DSU associated with Benign, PDD retained its implicit class status as a Benign. Based on the class association rules given in the Methodology, the final class status was assigned to each entity in row C. We list the rules again here as a direct reference to the readers.

The Entity Class-Association Rules
1. An entity is assigned as a Cor, an Inc or a Und when found from wCL, but its class label will be readjusted to Cra accordingly if found and confirmed in wCL/nCL. (Rationale: we give strongest weight to the ground truth unless label discrepancy is spotted).
2. An entity is an OL or an Und only if affirmed in both wCL and nCL but assumes the class status of the class label found in either.
Note that E1, E175, E194, E458, E459, E504, E617, E624 and E699 were considered as Cor by rule 1 since none of them is Cra. E37 and E487 were OLs found in wCL, but additional information in nCL showed that they were Benign and Cancerous respectively (Row B). Therefore, the final class label was assigned to them as Benign and Cancerous correspondingly (Row C).
With these rules, we obtained the integrated class association results as displayed at the bottom of the Entity Space. The accuracy rate in entity class association was 95.57% before Cra and 99.57% after. We will have a detailed discussion in the interpretability and traceability section after the entity clustering section.

(a)
Cluster #  C1  C2  C3  C4  C5  C6  C7  C8  C9  C10  C11  C12   Class labels  B  B  C  C  C  C  C  C  C  B  C  B   Cluster size  384  54  159  27  8  32  3  12  14  3  1  2  PDD results (majority  with distinct CL)  378  52  139  24  8  31  3  11  14  3 1 2 (b) Supplementary Figure. 7 Entity Cluster on Wisconsin Cancer Dataset. a. The clustering results with no class label given in the dataset is displayed. Column 1 displays the class associated with the clusters based on the implicit class labels of its majority entities and the sample size of subgroups therein. For example, in cluster C1, it contains 384 entities with 378 in the subgroup Benign (B), 6 outliers (OL) and 1 Cra which was labelled as cancerous but found possessing only Benign pattern(s). Column 2 shows its EIDs in the color-code of the final class status assigned in the Knowledge Base (Supplementary Figure 6). Columns 3 to 5 are the triple code of the DSU on which the entity in the cluster was found. Column 5 and Column 6 display the implicit class label and the final discovered/rectified class labels. The last column shows the entity cluster placement status -"Cor" denote correctly placed, "Misplaced" as incorrectly placed. The Summary at the bottom of the figure shows the efficacious unsupervised results. The entity placement accuracy 97.29% before Cra was estimated using the implicit class label of the entities being placed into the wrong clusters after PDD disentanglement and pattern discovery. That after Cra of 99.57% was estimated based on placement of the entities with the final class status. b. We give a summary of teh cluster size against the number of its majority members paertaing to a distinct class/group. It shows that PDD can discover clusters with different sizes correctly based on the statistical strength of their patterns in an unsupervised setting.

Entity Clustering Results
In this case study with cancer cytopathological data, the entity clustering results render overwhelming evidence of the efficacy of PDD's unsupervised approach. Based upon AVassociation disentanglement, it unveiled superior Entity Cluster placement, i.e., placing entities into the right clusters (Supplementary Figure 7) while displaying details of each entity and clusters with statistically significant patterns to render transparency and reliability in support of the discovery.
In Supplementary Figure 7(a), we displayed some representatives of the correctly placed entities, and all anomalies found and Cra rectified together with their implicit and discovered class label.
For entity placement assessment, we used the final class status results from the Knowledge Base in the discovered pattern column. Here we give a brief description of the entity clustering results rectified by the class-association results from the Knowledge Base ( Supplementary Figure 7(a)).
In the dataset, there are 699 cases with 458 pertaining to Benign (B) and 241 to Cancerous (C). In the first column, for each cluster, we displayed its cluster ID (#), the cluster size, the size of each distinct group according to the final class status obtained from the Knowledge Base. For each cluster obtained, Supplementary Figure 7(b) summarized the cluster size against the number of its majority members pertaining to a distinct class/group. From the Entity Clustering Results, we notice that: a) PDD obtained clusters all of which consist of majority members pertaining to a distinct class, implying that the overall clustering result is correct.
b) The number of clusters was determined automatically based on the inherent disentangled patterns rather than the setting of optimization parameters.
c) The size of the clusters varied from 384 to one. The one contains only a single entity since it had no cluster to join, indicating PDD's capability to discover groups with imbalanced group size, even one with a single rare case. d) Entity cluster placement accuracy before Cra was estimated by taking the implicit class label given to the entities as ground truth. It was found to be 97.29%. That after Cra was estimated as 99.28% where the final class status obtained from PDD after Cra was used instead. In Figure 5, we observe that PDD outperformed the existing models.

Discussion: Interpretability and Traceability for Scientific Exploration
PDD's unsupervised approach not influenced by prior knowledge or other physical/human factors while removing or rectifying confirmed errors or biases unnoticed in the data can unveil inherent and intrinsic information to provide succinct, transparent, reliable, interpretable, and traceable knowledge in the Knowledge Base obtained from wCL and nCL as well as Entity Clusters. The results in this case study and others fully demonstrate the high accuracy in class association and clustering as well as efficacious interpretability and traceability with statistical support and functional implication. As for result interpretation, for taxonomic data, taking classes as primary sources is clear cut. However, for pathological and clinical data, to have precise labeling is a challenge due to variable factors such as the early stage of a disease/disorder, and existence of other factors/environments. Hence, we treat this case study as a statistical study within the scope of ML with such revealing capability of the borderline cases. Nevertheless, it unveils some intriguing results, suggesting cases of misdiagnosis, missed diagnosis and early diagnosis, a very important capability in cancer diagnosis and assessment as well as in clinical practice. It opens the door for further research.

Case Study 4: heart disease
Now we move on to a more variable dataset especially among the patients with the Presence and Absence of heart problems. It is a health care benchmark dataset from UCI repository [6] [7] containing 270 clinical records with 13 mixed-mode attributes in two possible classes: Absence or Presence (of heart disease), abbreviated as Abs and Prs. This represents a realistic interpretable clinical problem. We use it to illustrate the key capabilities of PDD and its special ways in dealing with anomalies and borderline cases. Figure 4 in Main presents the Knowledge Base obtained for the wCL with entity class association results integrated with those obtained from nCL. Again, in the Entity Space, we displayed only the anomalies and the representatives of the correctly classified cases. In this clinical data, we observed that three AVs in the Knowledge Base were not forming high-order patterns though they are traits related to heart problems. This implies that they do not have strong statistical interdependence among themselves or with other AVs and/or strong statistical association with the presence or absence of heart disease. In this Case Study, PDD did not use them and yet is able to obtain high class association accuracy, we still retained them in the pattern space, linking to the entity space, to allow further reference for clinical judgement and treatment.

Identification and Correction.
Patterns are high-order statistically significant associations. In clinical data, often, we find single individual traits which have association with disease classes but not much correlated with other factors. Such AVs do not form association patterns with other AVs in the data but are important by themselves for other clinical judgement in patients' care and treatment (like the 3 AVs (rpb, sc and fbs) (Figure 4 in Main). Figure 4 in Main shows strong evidence that DSUs and patterns associated with distinct primary sources/classes can be found in a clinical data set where the boundary between the disease and normal such as the "Absence" and "Presence" of heart problems is not as distinct as in the SR-A and the Breast Cancer cases. It also reveals borderline cases to alert further observation and judgement.

2) Knowledge Space of the Knowledge Base from wCL.
In the Knowledge Space of the summarized Knowledge Base obtained from wCL, we noticed superb AV-association disentanglement for the DSUs and the disentangled patterns, with class label or without class label (as unveiled in the class column in the Pattern Space). Those without class labels were associated with distinct primary sources/classes reflected by their pattern occurrences on entities with distinct implicit class label, substantiated by the success of entity class association as shown at the EID rows at the bottom section of the figure in the Entity Space. We observed that the AV-associations associated with the two classes were on the opposite side of the PC in the DS, containing distinct AVs on the same attributes in the disentangled patterns.

3) Pattern Space
In the Pattern Space, we observed the distinct patterns between these two groups. We found AVS rbp, sc and fbs are not forming high order patterns associated with classes since their possession by patients are not necessarily statistically interdependent. We also found low order association for patterns with no class label as an AV.

4) Entity Space
In the Entity Space of wCL, we observed strong pattern-to-class associations. We found 141 correct classifications out of 150 (94.00%) among Abs and 115 out of 120 (95.83%) among Prs.
Like the results for the Wisconsin Breast Cancer case, we noticed that more subjects among the normal (Abs) were found having the disease (Prs) than the diseased person (Prs) found to be normal =94.81% and that from the integrated results turned out to be 100% after the Cra which were statistically affirmed. By plotting the union of the disentangled patterns onto the entities in entity clusters as we shall discuss later, we noticed that the rectified ones were mostly borderline cases as they contain both Abs and Prs patterns. When class labels (explicit or implicit) are given in the data, they help both the classification as well as the discrepancies correction. Therefore, they may impact these cases to give higher class-association accuracy even up to 100% purely based on statistics. With transparency provided, such an accuracy can be further validated by tracing back to the patients' records or going through a closer examination. PDD brings in the alert -a step to assist clinical decision in general.

Entity Clustering
The heart disease data exemplifies the intriguing capability in relating entity clusters to the real world. Figure 4(b) in Main is the abridged results of Entity Clusters obtained without the influence of the class label. Each row is an entity with a distinct EDI. The columns in the table follow the convention of our previous entity clustering results. All those entities associated with a distinct DSU were found to belong to a distinct cluster pertaining to a class of its majority members. In PDD, it is the AV-association disentanglement that separates clusters. Hence, it does not require setting the number of clusters or finding optimal or fuzzy cluster configurations. The hierarchical clustering simply breaks a larger group into smaller groups based on the degree of overlapping of S-connected AVs/patterns. They share considerable similarity. The DSU triple code reveals clusters that are similar or distinct from each other. Clusters with the first two identical codes indicate that they are similar. If their second code is different, it implies that they are on the opposite side of the PC in a DS and thus distinct from each other. In PDD, entities forming a cluster are based on the patterns they possess. Their union pattern in its associated DSU reveals the characteristic and the underlying primary source of the cluster. Column 3 of Figure 4(b) in Main listed the implicit (original) class label (in class-color code) of each entity and Column 4 that of its final discovered class status integrating the Cra results from both wCL and nCL. For E3, the implicit class label was Abs and found associated with Abs. It was considered as correctly classified and thus correctly placed into the cluster (C1) associated with Abs. We denoted it as Cor in the entity placement column. E152 with a given implicit class label of Prs was found only possessing statistically significant patterns of Abs but none of the Prs pattern.
Hence, its placement in an Abs cluster was considered as correct. E216 was labeled as Prs and was also found possessing patterns of Prs, but was placed into a Abs cluster. Therefore, it was considered misplaced. As for E260, it was labeled as an Prs but readjusted as an Abs while being placed into a Prs cluster. Hence, it was considered as misplaced.
To give a reasonable entity cluster placement accuracy, we used the class label assigned to the entities before or after Cra as the base. The placement is considered as Cor if the entity with its assigned class label is placed into a cluster pertaining to the class of the assigned class label. Based on this simple rule, we found from the full table of entity clustering results the number of entities with assigned class labels being placed into the wrong clusters. As Supplementary Figure 8(b) shows, we found 54 entities misplaced before Cra and 36 after. Hence, we have an accuracy of 80% before and 86.67% after Cra.
When we plotted the discovered patterns on the AV cells on the entities in the entity cluster with darker color-codes of the discovered class, we noticed that most of the misplaced were borderline cases in the sense that they possess significant patterns of both classes. This may explain why the entity clustering results were different from the class-association results. We shall address this notion in the discussion section.

Discussion: Pattern Transparency, Class Status Association and Entity clustering Results
In traditional ML, classification usually adopts a k-fold cross-validation process to get the average performance for assessing and selecting the best rules for the classifier through fine-tuning feature engineering and parameter selection. Since in the traditional ML models, there is no way to identify and locate the anomalies and samples from uneven class distribution, the k-fold method is a good way to randomly distribute the samples and anomalies on the training and test sets to get the average performance. PDD discovers disentangled patterns disregarding where they are located in the data for class associations based on disentangled statistics, not relying on feature engineering or parameter tuning (in all the six case studies, we took the same set of parameters by default).
Hence, from the theoretical view and experimental results of the six case studies, PDD, by and large, could identify and confirm the rectified anomalies, discover rare and imbalanced groups/classes, producing class-association rules to get high accuracy for class association in Knowledge Base and entity cluster placement disregards where they were placed in the dataset.
From Figure 4(a) in Main, we were surprised to find the 100% class-association accuracy after Cra whereas in the Entity Cluster, we got only 86.67% accuracy after Cra, a significant drop. One of the clues we noticed from the superimposed patterns is that most of the misplaced in the entity clusters are borderline cases. Since in the Knowledge Base, PDD class-association exploited the class label given in both wCL and nCL, the class label could play a determinant factor in those cases. For a problem when the borderline is fuzzy, the unsupervised method such as PDD based on disentangled patterns with transparency and statistical support may provide a less biased and interpretable approach for the clinicians to watch and go deeper, particularly for some more subtle cases.
The display in Knowledge Base and Entity Clusters show the importance of the transparency of the detected patterns and statistical evidence to justify the rectification of the label discrepancies in both the Knowledge Base and the entity clustering results -a unique capability of PDD that helps to interpret and improve the quality of class association and clustering. It will change our view of class-association and cluster evaluation as it offers a new way of anomalies adjustment before the final acceptance of the results. While still providing a decision criterion, PDD will help clinical decision-making on anomalous cases and assist research and organization of the discovered knowledge which could be statistically and functionally confirmed.

Case Study 5: Thoracic Dataset -Imbalanced Class
To validate the capability of PDD for imbalance classification, another practically useful thoracic dataset was employed. The dataset described the surgical risk originally collected at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the years 2007-2011 [8]. It is composed of 470 samples with 14 attributes. To simulate the target scenario without requiring much tweaking, the numeric attributes PRE4, PRE5 and age were removed. The target attribute (taken as class label) is Risk. There are 400 samples labeled as  Figure 8(a)). This shows that PDD can discover fewer patterns with specific associations to the classes to furnish easy interpretation. Furthermore, even with few patterns, PDD can reveal succinct and comprehensive characteristics (as exemplified in the synthetic case) of all given classes, even when the class distribution is imbalanced.

Entity Clustering
The Thoracic datasets exemplifies the predictive capability of PDD for imbalanced classes. The Thoracic dataset consists of 470 samples with 14 attributes, but only 70 patients were labeled as "risk" and 400 "no risk", showing quite an imbalanced dataset. Without any correction, for wCL, PDD obtained an association accuracy of 95.25%, biased toward the "no risk", and an accuracy of 68.57% for "risk", resulting in an average accuracy of 91.27% and a balanced accuracy of 82%. This shows that without any anomaly detection, 429 entities were clustered in groups consistent with the original class label of their clustered entities. Here, we only show the clustering results of the remaining 41 entities (less than 10%) in Supplementary Figure 9, and find that, disregarding their implicit labelling, they were correctly placed into clusters with patterns pertaining to the other class as revealed by their patterns. Column 5 lists the implicit (original) class label of each entity and Column 6 that of its discovered class status obtained from the Knowledge Base. We highlight the patterns associated with Risk_F as green and the patterns associated with Risk_T as red.
Supplementary Figure 9 clearly shows that the entities in the group DSU [1 1 1] and DSU [3 1 1] were covered by the patterns associated with Risk_F, but for some (listed in Supplementary Figure   9), they were labeled as Risk_T. Similarly, the entities in the group DSU [1 2 1] were covered by the patterns associated with Risk_T, but they were labeled as Risk_F. This result shows that PDD clustered these 41 cases according to their possessed patterns of a class not complying with that of the labeled class.

Case Study 6: Synthetic Dataset -Class Imbalance and Noise Tolerance
Problem and Data. In the Case Study 2, we have shown from Class A Scavenger Receptor (SR-A) APC dataset that PDD outperforms K-means in unsupervised clustering significantly in all scores based on the taxonomic ground truth (with 50% level by K-means vs 90% level by PDD) (Supplementary Figure 4(b)). Similarly, for the clustering results of the Cytochrome C APC dataset, PDD also outperforms K-means (Supplementary Figure 10). Now we would like to design a verifiable experiment to explore the noise tolerance capability of PDD in comparison with other ML Models. We did that in this case study by adding noise columns to the original clean and succinct Cytochrome C APC dataset.
PDD can. This further shows why PDD can reduce the effect of noise without feature engineering since it extracts statistically significant AV associations at even a deeper feature value level.